Skip to content

test(ika-upgrade-test): out-of-process cross-binary upgrade harness#1727

Open
ycscaly wants to merge 183 commits into
devfrom
feat/ika-upgrade-test
Open

test(ika-upgrade-test): out-of-process cross-binary upgrade harness#1727
ycscaly wants to merge 183 commits into
devfrom
feat/ika-upgrade-test

Conversation

@ycscaly

@ycscaly ycscaly commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

New additive crate crates/ika-upgrade-test — an out-of-process test harness that spawns real, separately-compiled ika-validator binaries against an external sui localnet, swaps binaries on validators across epochs, and drives dWallet workloads. Unlike ika-test-cluster (in-process IkaNode, one binary), it can host genuinely different binaries in one committee. No changes to ika-node / ika-swarm.

Implements docs/cross-binary-upgrade-testing*.md; see docs/cross-binary-upgrade-testing-results.md for the full write-up.

Tests (all opt-in via env flags; need real binaries + a workspace-tag sui)

Test Status Proves
tests/smoke.rs (go/no-go) ✅ ~396s 4 out-of-process validators + notifier, network DKG, reach epoch 2
tests/cross_binary.rs ✅ ~722s Boot 4 on a v3-only binary, swap all to dev, capability vote advances v3 → v4
tests/workload.rs ✅ ~415s Full user DKG → Presign → Sign completes on-chain

The cross-binary run exercises, out of process: protocol-vote arithmetic, mid-epoch reconfiguration MPC across the swap, mixed-committee wire compat, and on-disk compat (restart on a new binary against the old RocksDB).

Crate layout

  • sui.rs — spawn external sui start --with-faucet --force-regenesis (waits for RPC and faucet).
  • cluster.rs — chain bootstrap via init_ika_on_sui + ValidatorConfigBuilder + a notifier fullnode; NodeConfig → YAML → child; on-chain wait_for_epoch / protocol-version via IkaClient.
  • process.rsValidatorProcess: spawn / stop / swap_binary, health via the admin RPC.
  • binary.rsBinarySpec (path / tag / sha / branch) + a sha-keyed git worktree build cache honoring each commit's pinned toolchain.
  • scenario.rs — imperative DSL (start / wait_for_epoch / stop_and_swap / expect_protocol_version).
  • workload.rs — drives a user dWallet lifecycle by orchestrating the canonical ika CLI; confirms completion on-chain.

Key finding: mainnet-v1.1.8dev is not a naive rolling swap

Running the harness with the real mainnet-v1.1.8 ika-node fails at boot for the expected reason, not a harness bug: v1.1.8 links class_groups from dwallet-labs/inkrypto, dev from dwallet-labs/cryptography-private, and v4 publishes the combined ValidatorEncryptionKeysAndProofs where v1.1.8 expects the bare ClassGroupsEncryptionKeyAndProof — the exact incompatibility flagged in validator_initialization_config.rs (⚠️ MAINNET WIRE-FORMAT INCOMPATIBILITY ⚠️). The v1.1.8 node loads its config, connects to Sui, reads the contracts, then fails decoding the on-chain validator record (class groups public key … remaining input). So the harness genuinely runs a different binary and fails on the documented wire-format divergence.

To demonstrate a successful heterogeneous upgrade, the green cross_binary run uses an OLD binary that is a dev build pinned to MAX_PROTOCOL_VERSION=3 (same crypto, differs only in advertised protocol version) — disclosed in the test and results doc.

Notes

  • Builds use --no-default-features to drop enforce-minimum-cpu (panics on < 16-core hosts).
  • The workload uses a dedicated funded user (faucet SUI + IKA transfer) to avoid contention with the notifier; register-encryption-key precedes create; v4 genesis for internal_presign_sessions; long epoch to clear the mid-epoch reconfiguration; sign confirmed via the coordinator's on-chain completed-session count.

🤖 Generated with Claude Code

omersadika and others added 30 commits May 17, 2026 16:16
Foundation for the off-chain validator-metadata read flow. Pure
types and no-op consensus dispatch — no behavior change, so the
acceptance gate `test_network_dkg_full_flow` still passes.

New types in `ika_types::validator_metadata`:
- ValidatorMpcDataAnnouncement / SignedValidatorMpcDataAnnouncement
- HandoffItemKey (sorted enum: NetworkDkgOutput | NetworkReconfigurationOutput | ValidatorMpcData)
- HandoffAttestation with `items: Vec<(HandoffItemKey, [u8;32])>` sorted strictly ascending — plain length-prefixed BCS list, no map-aware bindings needed for non-Rust verifiers
- HandoffSignatureMessage (Ed25519 sig by consensus key, NOT protocol key)
- CertifiedHandoffAttestation (Vec<(AuthorityName, Ed25519Signature)>; Ed25519 doesn't aggregate)
- EpochMpcDataReadySignal

IntentScope: +ValidatorMpcDataAnnouncement, +HandoffAttestation.

ConsensusTransactionKind + Key: 3 new variants + constructors +
key extraction + Debug arms. AuthorityPerEpochStore /
consensus_handler / consensus_validator wire dispatch as no-ops
(actual handlers land in later steps); the per-epoch sender-author
match enforces wire-binding for HandoffSignature and
EpochMpcDataReadySignal (signer == consensus author), and is a
trivial pass for ValidatorMpcDataAnnouncement (the inner BLS sig
authenticates the validator's intent independent of the relayer).

Unit tests cover BCS roundtrip + sort stability + ready-signal
roundtrip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Anemo `ValidatorMetadata` service with one method
`GetMpcDataBlob(blob_hash) -> Option<MpcDataBlob>`. Backed by an
`InMemoryBlobStore` (RwLock<HashMap<[u8;32], Vec<u8>>>) implementing
`MpcDataBlobStorage`. Callers hash-verify returned bytes — the
network layer doesn't, and the doc comment on `fetch_blob` says so.

`AuthorityPerpetualTables::mpc_artifact_blobs: DBMap<[u8;32], Vec<u8>>`
with insert / get / iter helpers — the cross-restart store. At node
startup `create_p2p_network` iterates that table and hydrates the
in-memory cache before mounting the anemo server, so a restart
keeps serving whatever blobs the validator had persisted.

No producers or consumers wire up yet — those land in subsequent
steps. The endpoint just serves whatever's been inserted (initially
nothing on a fresh node).

Acceptance gate `test_network_dkg_full_flow` passes (142s).
2 new unit tests in ika-network (`in_memory_blob_store_roundtrip`,
`mpc_data_blob_hash_is_deterministic`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Producer side (ika_core::validator_metadata):
- derive_mpc_data_blob(seed) returns the canonical BCS-encoded
  VersionedMPCData::V1 bytes — same encoding the CLI submits on
  chain via set_next_epoch_mpc_data_bytes. Deterministic from
  seed, so off-chain blobs hash-match chain bytes.
- now_ms() for the announcement timestamp (latest-by-timestamp
  rule means later calls win, which is correct after a seed
  rotation).
- sign_validator_mpc_data_announcement(...) builds + BLS-signs the
  announcement ready for consensus.

Consumer side (AuthorityPerEpochStore):
- New per-epoch table validator_mpc_data_announcements:
  DBMap<AuthorityName, SignedValidatorMpcDataAnnouncement>.
- record_validator_mpc_data_announcement verifies the BLS sig
  against self.committee() (current-epoch path only — next-epoch
  joiner path deferred to step 6) and applies the
  latest-by-timestamp rule on insert. Replays and stale duplicates
  are silently dropped.
- get_validator_mpc_data_announcement accessor.
- Consensus dispatch wires the ConsensusTransactionKind::
  ValidatorMpcDataAnnouncement variant through.

Unit tests in ika-core::validator_metadata:
- derive_mpc_data_blob_is_deterministic
- sign_announcement_verifies_against_signer (covers intent
  scope + epoch binding + tamper detection).

Acceptance gate test_network_dkg_full_flow still passes (143s).
No producers wired up yet — they land in subsequent steps along
with the ready-signal freeze.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two new epoch tables and a producer helper for the freeze step
of the off-chain validator-metadata flow.

`epoch_mpc_data_ready_signals` records, per authority, that this
validator has decided its mpc_data input set is sufficient (`>=
quorum_threshold` announcements observed). The first incoming signal
that crosses quorum triggers `freeze_mpc_data_if_first`, which
idempotently snapshots `validator_mpc_data_announcements` into
`frozen_validator_mpc_data_input_set` — the immutable, content-
addressed view of validator mpc_data used by all downstream
consumers (handoff, reconfig, joiner bootstrap).

The signal payload itself is unauthenticated; authorisation is the
consensus binding (the authority that submitted the transaction).
This is enforced at consensus dispatch in `AuthorityPerEpochStore`.

Producer side: `build_epoch_mpc_data_ready_signal_transaction` wraps
the signal in a `ConsensusTransaction` ready for the consensus
adapter.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.28s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Joining validators (in V_{e+1} but not in V_e) can't submit
directly to consensus because they aren't members of the current
consensus committee. They fan out their signed mpc_data
announcement to every current-committee peer over a new Anemo RPC
`SubmitMpcDataAnnouncement`; one honest relayer is enough to land
the announcement in consensus.

This commit lands the transport only:
- `SubmitMpcDataAnnouncementRequest{Response}` wire types.
- `AnnouncementRelay` trait (impl supplied by the node once epoch
  store + consensus adapter are up).
- `AnnouncementRelayHandle` — an `ArcSwapOption` late-binding
  holder, installed at first epoch start and re-installed across
  epoch boundaries. The Anemo server is constructed at node
  startup before any epoch store exists, so install-after-the-fact
  is needed.
- Anemo server impl that returns `Rejected` while the relay is
  uninstalled (joiners retry) and dispatches to the active relay
  otherwise.
- Client helpers: `submit_announcement_to_peer` (single peer) and
  `submit_announcement_to_committee` (concurrent fan-out).

Installation of the actual relay impl (which performs signature
verification against the pending active set) is deferred to the
PendingActiveSet step, since the relay needs that verification
before it can safely submit.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.61s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the placeholder next-epoch branch in
`record_validator_mpc_data_announcement` with real signature
verification gated on a `JoinerPubkeyProvider`.

`JoinerPubkeyProvider::is_registered_joiner(&AuthorityName) -> bool`
is the trait the Sui-backed lookup will implement; a future step
populates it from `validator_set.pending_active_set` plus each
entry's `StakingPool.validator_info`'s next-epoch pubkey. Until
that lands, `joiner_pubkey_provider` is unset and all next-epoch
announcements drop — current-epoch flow is unchanged.

`verify_joiner_announcement` is a pure helper (caller passes
`expected_epoch` and the provider). The per-epoch-store method
calls it and reacts to the four-way verdict
(Accept/UnregisteredJoiner/InvalidSignature/InconsistentEnvelope);
only `Accept` proceeds to the latest-by-timestamp insert rule.

The provider is held in an `ArcSwapOption` on
`AuthorityPerEpochStore`, swappable across epoch boundaries via
`install_joiner_pubkey_provider` / `clear_joiner_pubkey_provider`.
`AuthorityName == AuthorityPublicKeyBytes`, so the verifier uses
`signed.auth_sig.authority` as the pubkey directly — the provider
only authorizes *which* names are joinable.

Tests cover Accept, UnregisteredJoiner, InvalidSignature (tampered
blob hash), InconsistentEnvelope (wrong epoch + authority field
mismatch), and `StaticJoinerPubkeyProvider` membership semantics.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 148.28s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Lands the canonical, off-chain handoff attestation primitives
behind the next-step record/persist plumbing. These are the
building blocks each validator runs locally at EndOfPublish
(builder + signer) and that every validator runs on incoming
consensus signatures (verifier + aggregator).

- `build_handoff_attestation`: sorts items strictly ascending by
  `HandoffItemKey` (the wire format is a Vec, not a map, so the
  sort defines the canonical bytes every signer commits to);
  rejects duplicate keys.
- `hash_next_committee_pubkey_set`: dedup + sort + BCS-encode +
  Blake2b256 over the next committee's pubkey set. This goes in
  the attestation header, so verifiers can confirm the cert is
  bound to the committee they're handing off to.
- `sign_handoff_attestation`: Ed25519 over
  `bcs(IntentMessage::new(HandoffAttestation, attestation))` —
  signed with the validator's *consensus* key, NOT BLS. (Joiners
  look up signers' consensus pubkeys in the prior committee's
  on-chain validator info.)
- `ConsensusPubkeyProvider` trait + `StaticConsensusPubkeyProvider`
  for the consensus-pubkey lookup, mirroring the joiner-provider
  shape from step 6.
- `verify_handoff_signature` returns a four-way verdict
  (Accept/UnknownSigner/InvalidSignature/AttestationMismatch).
- `HandoffAggregator`: one-shot stake-weighted aggregator that
  emits `CertifiedHandoffAttestation` the first time signers
  cross `committee.quorum_threshold()`. Replacements don't
  double-count; non-committee signers are silently dropped (the
  consensus path also rejects them at the dispatch site, but the
  aggregator is defense-in-depth).
- `verify_certified_handoff_attestation`: standalone re-verify
  against a committee + provider — what joiners run during
  bootstrap on the cert they fetched.

Tests cover sort canonicalization, duplicate-key rejection,
pubkey-set hash invariance under reorder and dedup, sign+verify
round trip with the four verdict outcomes, aggregator quorum
crossing, replacement no-op, non-committee signer no-op, and
end-to-end certify-then-re-verify-with-tampered-sig.

Record / persist / EndOfPublish-trigger wiring land in
follow-on commits; these helpers are isolated and consumed at
those sites.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.26s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the consensus dispatch path for `HandoffSignature` to verify,
persist, and aggregate incoming Ed25519 signatures over the epoch's
handoff attestation.

Per-epoch state on `AuthorityPerEpochStore`:
- `handoff_signatures: DBMap<AuthorityName, Ed25519Signature>` —
  durable record of each verified signer's sig. Replays are
  no-ops via typed-store insert semantics.
- `expected_handoff_attestation: ArcSwapOption<HandoffAttestation>`
  — this validator's locally-computed attestation, installed by
  the producer side once mpc_data is frozen + DKG/reconfig digests
  are known. Until installed, incoming signatures drop silently
  (`AttestationMismatch` is the only possible verdict).
- `consensus_pubkey_provider: ArcSwapOption<...>` — Ed25519 lookup
  for signer pubkeys, populated by the same sui_syncer task that
  feeds the joiner provider.
- `handoff_aggregator: Mutex<Option<HandoffAggregator>>` — in-memory
  stake accumulator. Rebuilt from persisted signatures when the
  expected attestation is (re)installed, so restart replay folds
  prior consensus-ordered signatures back in correctly.

New pure helper in `validator_metadata`:
- `process_handoff_signature` runs `verify_handoff_signature` and,
  on `Accept`, inserts into the aggregator. Returns one of
  `Recorded`, `Certified(cert)`, or `Rejected(verdict)`. Three new
  unit tests cover quorum-crossing, attestation mismatch, and
  unknown-signer paths.

`PartialEq`/`Eq` added to `HandoffSignatureMessage` and
`CertifiedHandoffAttestation` so the record-outcome enum can derive
those traits for tests.

Consensus dispatch: the `HandoffSignature` arm now calls
`record_handoff_signature`. The returned cert (when quorum just
crossed) is intentionally dropped on the floor for now — the
perpetual-persist plumbing (step 7c) hangs off a dedicated drain
task that pulls from the in-memory aggregator. Dropping is safe
because the *next* ordered signature crossing quorum still mints a
cert, and restart-replay rebuilds the aggregator.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.08s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the handoff write path: once `record_handoff_signature`'s
in-memory aggregator crosses quorum, the resulting
`CertifiedHandoffAttestation` is immediately persisted into a
keep-forever perpetual table.

`AuthorityPerpetualTables`:
- New `certified_handoff_attestations: DBMap<EpochId,
  CertifiedHandoffAttestation>` table, keyed by the epoch the
  outgoing committee is handing off *from*.
- `insert_certified_handoff_attestation`,
  `get_certified_handoff_attestation`,
  `iter_certified_handoff_attestations` accessors.

The handoff feedback rule (keep certs forever) is load-bearing
because a joiner pulling history may need to verify the chain back
to whichever cert it has a trusted committee for; skipping any
single epoch's cert would permanently break their ability to
bootstrap.

`AuthorityPerEpochStore` gains
`perpetual_tables_for_handoff: ArcSwapOption<...>` plus
`install_perpetual_tables_for_handoff`. `ika-node` installs the
perpetual handle directly after constructing the epoch store, so
the very first cert produced by consensus lands on disk. When
nothing is installed (e.g. unit tests that don't wire perpetual),
the record path logs at debug level and keeps going — the cert
stays in the in-memory aggregator and joiner-bootstrap consumers
will simply miss it.

The `Certified` arm of `record_handoff_signature` now also
performs the perpetual write, with the persist failure logged
(not propagated) — failing the entire consensus-dispatch path on
a perpetual-DB hiccup would be far worse than a missing cert.

Tests: 3 new perpetual-table unit tests cover insert/get
roundtrip, ordered iteration across epochs, and byte-level
idempotency on identical re-writes.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 141.68s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the producer half of the handoff loop: when this validator
reaches EndOfPublish, the same task that submits its
`EndOfPublish` consensus transaction also builds, installs, signs,
and submits its `HandoffSignatureMessage` for the epoch — exactly
once.

The trigger pipeline:
1. `compute_handoff_items` (pure): combines frozen mpc_data set +
   per-network-key DKG output digests + per-network-key reconfig
   output digests into a sorted Vec<(HandoffItemKey, [u8;32])>.
   Empty inputs are valid (yields an empty list) — important
   because DKG/reconfig digest caching is step 9, and the
   attestation needs to be signable before then.
2. `AuthorityPerEpochStore::build_local_handoff_attestation`:
   reads the frozen set, hashes the supplied next-committee
   pubkey set, calls compute_handoff_items, and builds a
   well-formed attestation.
3. `AuthorityPerEpochStore::build_local_handoff_signature_transaction`:
   installs the attestation locally (so the per-epoch record path
   accepts matching peer signatures), signs it with the consensus
   key, and wraps it in a `ConsensusTransaction`.
4. `EndOfPublishSender` is upgraded to take the consensus keypair
   (Arc) + a `Receiver<Committee>` for the next epoch, plus an
   `AtomicBool` one-shot flag. The handoff submit happens after
   the EndOfPublish submit on the same tick.

Determinism across validators: identical inputs → identical
attestation bytes → matching signatures. The frozen set is
already agreed (step 4's quorum freeze); the next-committee
pubkey set is read from chain. Until step 9 populates DKG/reconfig
digests, every validator computes an attestation with those slots
empty — still agreed.

The handoff record path (step 7b) was already wired to consume
these signatures, and the perpetual persist (step 7c) writes the
cert as soon as quorum is reached. With this commit, the cycle
runs end-to-end given an actual EndOfPublish trigger.

Tests: 2 new unit tests cover `compute_handoff_items` sorting +
empty-input semantics, in addition to the existing 19 helpers
tests.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 144.29s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the read side that closes the handoff loop: peers can pull a
`CertifiedHandoffAttestation` for any persisted epoch over a new
`ValidatorMetadata::GetCertifiedHandoffAttestation` RPC, and joiners
have a single-hop verification helper that binds the cert to the
specific committee they're trying to join.

Network layer:
- New `GetCertifiedHandoffAttestationRequest { epoch }` wire type.
- New `HandoffCertStorage` trait — the read-only counterpart to
  the perpetual store. Server holds an `Arc<C: HandoffCertStorage>`
  alongside the existing blob store.
- `ValidatorMetadataServer` is now `Server<S, C>`; the
  `build_server(storage, relay, cert_storage)` signature gained the
  `cert_storage` arg.
- Joiner-side `fetch_certified_handoff_attestation(network, peer,
  epoch)` mirrors the existing `fetch_blob`.

Adapter:
- `AuthorityPerpetualTables` implements `HandoffCertStorage` by
  delegating to `get_certified_handoff_attestation` and logging
  (not propagating) a perpetual-read error as `None`. The Anemo
  hot path can't surface a typed error usefully.

ika-node:
- The perpetual handle is now passed into `build_server` so peers
  immediately see every cert that lands on disk (via step 7c's
  perpetual persist). No additional installation needed because
  `AuthorityPerpetualTables` is constructed eagerly at startup.

Joiner bootstrap helper in `ika-core::validator_metadata`:
- `verify_joiner_bootstrap_cert(cert, prior_committee, prior_
  consensus_pubkeys, expected_next_committee_pubkeys)` runs the
  full check: pubkey-set-hash binding (so a malicious peer can't
  hand a real cert for a different committee), then delegates to
  the existing `verify_certified_handoff_attestation` for the
  signature/stake check. One-hop only — joiners verify against
  the *prior* committee, not back to genesis. (Per handoff design
  memo: anchoring trust to the prior committee is sufficient since
  the joiner gets there through earlier hops they either already
  trust or are themselves bootstrapping from a known anchor.)

Tests: 1 new unit test exercising both the happy path and the
pubkey-set-mismatch refusal.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.31s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Populates the producer-side caches that feed the handoff
attestation's `NetworkDkgOutput` / `NetworkReconfigurationOutput`
items.

`AuthorityPerEpochStoreTrait` gains two methods, called from the
MPC producer at the exact point it builds the consensus output:
- `cache_network_dkg_output(key_id, output_bytes)`
- `cache_network_reconfiguration_output(key_id, output_bytes)`

Concrete `AuthorityPerEpochStore` impl:
- Hashes `output_bytes` to Blake2b256 (matching `mpc_data_blob_hash`'s
  function so peers can fetch this blob over the existing
  `GetMpcDataBlob` RPC).
- Writes the digest into one of two new per-epoch tables —
  `network_dkg_output_digests` or
  `network_reconfiguration_output_digests` — keyed by
  `dwallet_network_encryption_key_id`.
- Writes the blob bytes into perpetual `mpc_artifact_blobs` (if
  the perpetual handle is installed) so cross-restart serves work
  for free.
- All writes are idempotent on byte-identical replays.

`build_local_handoff_attestation` no longer takes the digest maps
as parameters; it reads them straight off the per-epoch store.
`EndOfPublishSender::send_handoff_signature` is updated to match.

Producer hook: `DWalletMPCService::new_dwallet_mpc_output`'s
User/System branch calls the trait methods for the DKG and
reconfig protocols (`!rejected` only — rejected outputs are
empty and shouldn't pollute the cache). Cache failures are
logged, not propagated — they don't fail the consensus output
emit, just degrade peer serveability.

`TestingAuthorityPerEpochStore` gets no-op impls; the integration
test gate doesn't exercise attestation contents so an in-memory
mirror isn't needed.

Tests: 2 new unit tests cover the per-epoch table semantics —
digest roundtrip + replay idempotency, and independence of the
DKG vs reconfig caches when keyed by the same key_id.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 141.54s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the per-network-key counterpart to `EpochMpcDataReadySignal`.
Validators can now signal readiness for a specific network key's
DKG (`NetworkKeyDKGReadySignal { authority, network_key_id,
epoch }`) earlier than the epoch-wide signal, because per-key
readiness is a narrower commitment — the validator only needs the
mpc_data required for *this* key, not all reconfig sessions.

Per-epoch state:
- `network_key_dkg_ready_signals: DBMap<(ObjectID, AuthorityName),
  ()>` — per-key, per-authority votes. Composite key keeps quorums
  scoped: the same authority signaling readiness for two keys
  produces two independent entries.

Record path:
- `record_network_key_dkg_ready_signal` is idempotent on replays.
  Quorum is per-key (sum stake of all authorities that signaled
  for `signal.network_key_id`). The first quorum of *any* signal
  kind — epoch-wide or per-key — calls `freeze_mpc_data_if_first`,
  which is already idempotent on a non-empty frozen set. Per-key
  quorums after that point are still recorded (DKG kickoff per key
  consumes them) but don't re-freeze.
- `has_network_key_dkg_ready_quorum(network_key_id)` exposes the
  per-key quorum state for step 14's session-kickoff gating.

Consensus wiring:
- New `ConsensusTransactionKind::NetworkKeyDKGReadySignal` +
  matching `ConsensusTransactionKey` variant.
- `new_network_key_dkg_ready_signal` constructor.
- Sender-authority check at verification time (consensus binding
  is the only authentication; no payload signature).
- Metric label + validator pass-through arms.

Producer helper:
- `build_network_key_dkg_ready_signal_transaction(authority,
  network_key_id, epoch)` wraps a signal in a
  `ConsensusTransaction` ready for submission.

Tests: 1 new unit test on `AuthorityEpochTables`'s
`network_key_dkg_ready_signals` table covers composite-key
scoping + replay idempotency.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.54s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Filters the frozen mpc_data input set down to the union of the
current and next committees before it's consumed by handoff cert
build (and, in step 14, reconfig MPC). Validators who announced
mpc_data this epoch but withdrew before next_committee was
selected get dropped — the cert no longer pins their entries and
reconfig MPC won't allocate work for them.

`compute_effective_reconfig_input_set(frozen, current, next) ->
BTreeMap<AuthorityName, [u8;32]>` is the pure helper; it
intersects with the union of both committee membership lists.
Both committee inputs are `IntoIterator` so callers can hand it
whatever shape they already have (Vec, &[..], `voting_rights`
iter).

`AuthorityPerEpochStore::get_effective_reconfig_input_set` reads
the frozen set and the current committee from the store and
delegates to the pure helper. `build_local_handoff_attestation`
now goes through this method instead of pulling `frozen` raw,
so cert items reflect the effective set.

Tests: 2 new unit tests cover the intersection semantics —
a four-author scenario where staying members, joiners, and
withdrawers each take their expected path through the filter, plus
the degenerate case where no announcer overlaps the committees.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.88s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the read-side abstraction that lets the sui_syncer prefer
locally-cached protocol output blobs over the chain blobs when
assembling `DWalletNetworkEncryptionKeyData`. The lightweight
fields (id, current_epoch, dkg_at_epoch, state) always come from
chain — those are authoritative — but the large
`network_dkg_public_output` and
`current_reconfiguration_public_output` blobs can come from the
local content-addressed cache populated by step 9's producer
caching.

New in `ika-core::validator_metadata`:
- `NetworkKeyBlobSource` trait: `network_dkg_output_blob(key_id)`
  and `network_reconfiguration_output_blob(key_id)`, both
  returning `Option<Vec<u8>>`. `None` means "fall back to chain".
- `StaticNetworkKeyBlobSource` — empty-by-default in-memory impl,
  used by tests and as the typed-empty default.
- `fetch_network_key_data_with_off_chain_blobs(chain_data,
  source) -> DWalletNetworkEncryptionKeyData`: takes the chain
  copy, overlays each large blob from `source` if present.

`AuthorityPerEpochStore` implements `NetworkKeyBlobSource` by
looking up the per-epoch digest cache from step 9
(`network_dkg_output_digests` / `network_reconfiguration_output_
digests`) and then fetching the blob bytes from the perpetual
`mpc_artifact_blobs` store. A missing digest *or* a missing blob
returns `None` — every step in the chain has the chain fallback
behind it.

Syncer wiring (replacing the chain-read in
`sui_syncer::sync_dwallet_network_keys` with the wrapper) is the
next commit; this one lays the infrastructure.

Tests: 2 new unit tests cover the overlay semantics — partial
overlay (DKG from source, reconfig from chain) and the
all-fall-back case where the source is empty and the merged data
equals the chain copy byte-for-byte.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.76s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the off-chain assembler for the load-bearing
`Committee.class_groups_public_keys_and_proofs` map — the
HashMap reconfig MPC reads to find each committee member's
class-groups encryption key + correctness proof. The new path
decodes blobs locally from the perpetual `mpc_artifact_blobs`
store, keyed by digests pinned in the validators'
`ValidatorMpcDataAnnouncement`s.

The completion gate (per the design memo) is strict:
`assemble_committee_class_groups_off_chain` returns
`OffChainClassGroupsAssembly::Complete(map)` *only* when every
supplied authority resolved successfully — blob found, BCS-
decoded to `VersionedMPCData`, inner bytes decoded to
`ClassGroupsEncryptionKeyAndProof`. Even one missing or
malformed entry forces `Incomplete { missing: [...] }`, and the
caller must fall back to the chain-read path.

Why strict: reconfig MPC reads
`Committee.class_groups_public_keys_and_proofs[authority]`
directly, and a missing/empty entry silently drops that
validator's share without aborting. The existing chain-read path
in `sui_syncer::new_committee` already has this footgun (a
`filter_map` that swallows decode errors per-validator); the
off-chain path *must not* repeat it. Hence: all-or-nothing.

Wiring `sui_syncer::new_committee` to try off-chain first and
fall back on `Incomplete` is the next commit; this commit lands
the pure assembler.

Tests: 3 new unit tests cover (a) the happy path — two seeded
blobs round-trip through `derive_mpc_data_blob` →
`mpc_data_blob_hash` → an in-memory store → assembly back into
the map; (b) missing-blob aborts with the missing authority
listed; (c) corrupt-blob (bytes don't decode as
`VersionedMPCData`) also aborts.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.26s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DKG and reconfig sessions now wait on the off-chain mpc_data
freeze before instantiating. Honest validators that observe the
chain event before the consensus-side freeze quorum lands park
the request and retry on every subsequent batch cycle until the
gate opens.

Gate conditions, evaluated against the per-epoch store:
- `NetworkEncryptionKeyDkg(key_id)` requires
  `is_mpc_data_frozen() && has_network_key_dkg_ready_quorum(key_id)`.
  Per-key quorum makes a stronger commitment than the epoch-wide
  signal: it certifies that this *specific* key has enough peers
  ready to actually participate.
- `NetworkEncryptionKeyReconfiguration(_)` requires only
  `is_mpc_data_frozen()`. Reconfig sweeps every key the validator
  knows about; a per-key gate would deadlock if the per-key
  quorum needed reconfig output for kickoff.
- Everything else (user DKG, presign, sign, etc.) is unaffected.

`AuthorityPerEpochStoreTrait` gains the two query methods
`is_mpc_data_frozen` and `has_network_key_dkg_ready_quorum`,
implemented concretely against `frozen_validator_mpc_data_input_set`
and `network_key_dkg_ready_signals` respectively. The previously
inherent-only `has_network_key_dkg_ready_quorum` is gone — it's
now exclusively a trait method.

`TestingAuthorityPerEpochStore`'s impls return `Ok(true)` for
both: integration tests don't drive the freeze flow end-to-end
and would otherwise deadlock at the gate. Production builds use
the real store where these reflect actual consensus-observed
state.

In the manager, a new `requests_pending_for_frozen_mpc_data:
Vec<DWalletSessionRequest>` queue mirrors the existing pending
queues. Drained at the top of every `handle_mpc_request_batch`
by re-running each request through `handle_mpc_request`. Requests
that don't pass get re-queued; those that do proceed through the
existing kickoff path.

Made `DWalletMPCManager.epoch_store` `pub(crate)` so the gate
check in `mpc_session.rs` can reach it.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 144.14s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the producer-side task without which the off-chain freeze
quorum can never be reached, leaving step 14's kickoff gate
permanently closed and stalling network DKG / reconfig.

The new `MpcDataAnnouncementSender` (sibling of
`EndOfPublishSender` under `sui_connector`) runs once per epoch
per validator and:
1. Derives the canonical class-groups `mpc_data` blob from the
   validator's `RootSeed` (via `derive_mpc_data_blob` — identical
   bytes to what the CLI submits on chain).
2. Persists the blob into perpetual `mpc_artifact_blobs` so
   peers can fetch it by digest over the existing
   `GetMpcDataBlob` RPC.
3. Signs and submits a `ValidatorMpcDataAnnouncement` over
   consensus. Submission is idempotent — replays use the latest-
   by-timestamp rule.
4. After its own announcement is in, submits an
   `EpochMpcDataReadySignal` — one of two signal types whose
   quorum drives `freeze_mpc_data_if_first`.
5. Submits `NetworkKeyDKGReadySignal` for every known network
   key (deduped via a `HashSet`).

Each of (3), (4), (5) is gated by its own one-shot flag plus
ack-on-success, so a transient consensus-adapter failure causes
a retry on the next tick (every 2s) rather than blowing up the
task.

Step-14 gate softened to match the design memo's "first quorum
of either signal type freezes mpc_data" — DKG kickoff now only
requires `is_mpc_data_frozen()`, same as reconfig. The per-key
signal stays as an alternate freeze trigger but isn't a separate
hard requirement, since the sui_syncer skips
`AwaitingNetworkDKG` keys from the network-keys snapshot,
meaning the producer task can't observe a fresh DKG-target key
to signal for until *after* DKG completes — which would
deadlock.

Wired from `ika-node::monitor_reconfiguration` alongside
`EndOfPublishSender`. `AuthorityState::perpetual_tables()` added
to expose the perpetual handle without making the field public.

The aborted-on-epoch-end pattern follows
`end_of_publish_sender_handle`.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.64s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Lights up step 6's joiner verify path by installing a
`StaticJoinerPubkeyProvider` on the current epoch store, sourced
from the next-epoch committee snapshot already kept live by
`sui_syncer::sync_next_committee` and exposed via
`next_epoch_committee_receiver`. Without this, every next-epoch
(joiner) `ValidatorMpcDataAnnouncement` drops silently because the
provider field is `None` by default.

The new per-epoch `JoinerPubkeyProviderUpdater` task watches the
receiver, computes the joiner set as `V_{e+1}.voting_rights`'s
authority names, and calls
`AuthorityPerEpochStore::install_joiner_pubkey_provider`. Since
`AuthorityName == AuthorityPublicKeyBytes`, the BLS sig verify in
`verify_joiner_announcement` runs against the announcer's claimed
authority directly — no separate pubkey lookup needed.

Idempotent: `last_installed` cache short-circuits re-installation
when the underlying set is byte-identical to the last one we
installed.

This is a *simplification* of the design memo's "verify against
PendingActiveSet" prescription: we wait until V_{e+1} is selected
on chain instead of reading `PendingActiveSet` directly. Trade-off
— joiners can't announce earlier than V_{e+1} selection, but
reading the `ExtendedField` for PendingActiveSet would require a
new Sui dynamic-field plumbing path that isn't justified for v1.
Early-announce can be added later if join-latency becomes a real
concern.

Spawned alongside the producer task in
`monitor_reconfiguration`; aborted on epoch end via the same
pattern as `end_of_publish_sender_handle`.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 271.18s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the verify side of step 7's handoff loop. Without this, the
`ConsensusPubkeyProvider` field stays `None` and every incoming
`HandoffSignatureMessage` drops as `UnknownSigner` — meaning no
peer's signature ever counts toward the aggregator's quorum and the
cert never gets minted.

The new `ConsensusPubkeyProviderUpdater` task fetches the current
committee's `StakingPool.validator_info.consensus_pubkey_bytes`
directly via `sui_client.get_system_inner()` →
`active_committee.members` →
`get_validators_info_by_ids` → `verify().consensus_pubkey`. The
result is mapped `AuthorityName -> Ed25519PublicKey` and installed
as a `StaticConsensusPubkeyProvider` on the per-epoch store.

Cadence: 15s (consensus pubkey is fixed at validator registration
and shouldn't change mid-epoch). Idempotent re-install via a
base64-serialized cache key on the last installed map.

Sources the system inner directly rather than plumbing
`system_object_receiver` out of `SuiSyncer` — one extra RPC every
15s is cheaper than the receiver-broadcast plumbing.

Wired in `monitor_reconfiguration` alongside the
joiner-pubkey-provider updater and the producer task; aborted on
epoch end via the same pattern as `end_of_publish_sender_handle`.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 209.13s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires step 12's overlay into the chain-read path. The syncer's
`sync_dwallet_network_keys` task now applies
`fetch_network_key_data_with_off_chain_blobs` to every chain copy
before sending it on the watch channel, so consumers see locally-
cached DKG / reconfig output blobs (populated by step 9's
producer cache) instead of fetching them from chain on every
re-read.

Plumbing:
- `SuiConnectorService` gains
  `network_key_blob_source: Arc<ArcSwapOption<Box<dyn
  NetworkKeyBlobSource>>>` plus an
  `install_network_key_blob_source` method.
- The handle is created (empty) at service construction and
  passed by clone into the syncer task, where
  `sync_dwallet_network_keys` reads it on each fetch tick.
- New adapter `EpochStoreBlobSource` wraps
  `Weak<AuthorityPerEpochStore>` so the long-lived service can
  hold a per-epoch reference; the weak upgrade returns `None`
  cleanly when the epoch ends, which makes the overlay fall back
  to the chain blob via `unwrap_or` on each field.
- `ika-node::monitor_reconfiguration` calls
  `sui_connector_service.install_network_key_blob_source(...)`
  once per epoch with a fresh `EpochStoreBlobSource` pointing at
  the new `cur_epoch_store`. Each install atomically replaces the
  previous epoch's source.

The lightweight metadata (id, current_epoch, dkg_at_epoch, state)
always comes from chain — only the two large output blobs may be
overlaid. When no source is installed, behavior is unchanged
byte-for-byte.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 202.94s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires step 13's pure assembler (`assemble_committee_class_groups_off_chain`)
into the next-committee construction path. When the off-chain set
covers every committee member, the resulting class-groups
public-keys-and-proofs map comes straight from validators' own
`mpc_data` announcements + the perpetual blob store instead of
refetching from chain. `Incomplete` paths transparently fall
through to the existing `get_mpc_data_from_validators_pool` read.

New abstractions in `validator_metadata`:
- `OffChainCommitteeClassGroupsSource` trait — single method
  `try_assemble_class_groups(&[AuthorityName]) ->
  OffChainClassGroupsAssembly`.
- `EpochStoreClassGroupsSource` adapter holds
  `Weak<AuthorityPerEpochStore>` (for the per-authority
  announcement digest lookup) + `Arc<AuthorityPerpetualTables>`
  (for the digest→bytes blob lookup), and delegates to the pure
  assembler. Returns `Incomplete` cleanly when the weak upgrade
  fails (epoch ended).

Plumbing:
- `SuiConnectorService` gains a second
  `Arc<ArcSwapOption<Box<dyn OffChainCommitteeClassGroupsSource>>>`
  handle with a matching `install_class_groups_source` setter.
- The handle is passed by clone into `SuiSyncer::run` and on to
  `sync_next_committee` → `new_committee`, where the off-chain
  attempt happens before the chain read.
- `ika-node::monitor_reconfiguration` installs a fresh
  `EpochStoreClassGroupsSource` once per epoch right next to the
  blob-source install. Each install atomically replaces the
  previous epoch's source.

Strict-gate rationale preserved: `new_committee` only short-
circuits to the off-chain map on `Complete`. Any missing
authority — joiner whose announcement hasn't been verified yet,
blob not yet replicated, decode failure — falls through to chain,
which is the only safe option since the load-bearing rule says
reconfig MPC silently drops validators with no class-groups
entry.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 265.04s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the consumer side of step 5. The Anemo
`SubmitMpcDataAnnouncement` handler had been returning
`Rejected{"relay not installed"}` for every joiner submission;
this commit installs a concrete relay per epoch so the RPC
actually forwards joiner announcements into consensus.

The relay (`ConsensusBackedAnnouncementRelay` in
`sui_connector::announcement_relay`) runs three steps:
1. Cheap envelope checks — refuses unless
   `announcement.epoch == next_epoch`, since current-epoch
   announcements come from members who can submit themselves
   directly.
2. Joiner verify via the pure
   `validator_metadata::verify_joiner_announcement` against the
   per-epoch store's installed `JoinerPubkeyProvider` (populated
   by the joiner-provider syncer from step 6). Rejection here
   stops a malicious peer from using us as a spam pipe.
3. Wraps in `ConsensusTransaction::new_validator_mpc_data_announcement`
   and submits via the consensus adapter.

Plumbing:
- `P2pComponents` gains a `mpc_announcement_relay` field
  (`Arc<AnnouncementRelayHandle>`) so the long-lived handle the
  Anemo server already holds is also reachable from
  `monitor_reconfiguration`.
- `IkaNode` stashes the same handle so the per-epoch install
  loop can swap relays without re-touching the network layer.
- New `AuthorityPerEpochStore::joiner_pubkey_provider()` getter
  exposes the installed provider for the relay's verify step
  (mirrors the existing install/clear pair).

Install point: alongside the other per-epoch installs in
`monitor_reconfiguration`. Each epoch's relay holds
`Weak<AuthorityPerEpochStore>` so it naturally fails closed when
the epoch ends (returns "epoch ended" until the new epoch's
relay replaces it).

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 247.16s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reorganizes the four files that have no Sui RPC dependency and
shouldn't have been under `sui_connector/`. They all just hold a
`Weak<AuthorityPerEpochStore>` + an `Arc<dyn SubmitToConsensus>`
and run as per-epoch background tasks that emit
`ConsensusTransaction`s; that's a different responsibility from
`sui_connector/` (which talks to Sui RPC).

Moved (identical bytes):
- `sui_connector/end_of_publish_sender.rs` →
  `epoch_tasks/end_of_publish_sender.rs`
- `sui_connector/mpc_data_announcement_sender.rs` →
  `epoch_tasks/mpc_data_announcement_sender.rs`
- `sui_connector/joiner_pubkey_provider_updater.rs` →
  `epoch_tasks/joiner_pubkey_provider_updater.rs`
- `sui_connector/announcement_relay.rs` →
  `epoch_tasks/announcement_relay.rs`

Kept in `sui_connector/`:
- `consensus_pubkey_provider_updater.rs` — actually calls
  `sui_client.get_system_inner()` + `get_validators_info_by_ids`,
  so it belongs with the Sui-side updaters.

The four moved files use only `crate::` paths internally so no
import edits inside them; the only external rename is in
`ika-node/src/lib.rs` (s/sui_connector/epoch_tasks/ on four
call sites).

Module layout follows the CLAUDE.md `xxx.rs` convention:
new `crates/ika-core/src/epoch_tasks.rs` declares the four
submodules, files live in `epoch_tasks/`. No `mod.rs`.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 144.80s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three structural changes so the handoff loop is generic and not
phrased as a validator-metadata feature:

1) Types extracted to `ika-types::handoff`.
   `HandoffItemKey`, `HandoffAttestation`,
   `HandoffSignatureMessage`, and `CertifiedHandoffAttestation`
   move out of `validator_metadata.rs`. `validator_metadata.rs`
   keeps only the four validator-specific types
   (`ValidatorMpcDataAnnouncement`,
   `SignedValidatorMpcDataAnnouncement`,
   `EpochMpcDataReadySignal`, `NetworkKeyDKGReadySignal`).
   Cross-crate import sites updated.

2) `HandoffSignatureSender` extracted from `EndOfPublishSender`.
   The latter shrinks back to "submit EndOfPublish on the local
   trigger" and nothing else. The new sender lives in
   `epoch_tasks/handoff_signature_sender.rs` and runs on the same
   `end_of_publish_receiver` independently. ika-node spawns both
   side-by-side and aborts both on epoch end.

3) `HandoffItemsBuilder` trait + concrete
   `MpcDataHandoffItemsBuilder`. Item contributors plug in via the
   trait; `AuthorityPerEpochStore::build_local_handoff_attestation`
   now takes `&[Arc<dyn HandoffItemsBuilder>]` and folds each
   contribution into the attestation. Today only the MPC-data
   builder is registered (via `default_handoff_items_builders`);
   new features (NOA, sui-state pinning, etc.) can append their
   own builder without touching the producer or aggregator.

`HandoffItemKey` stays a typed enum for now — moving to opaque
byte keys was the fourth level I called out and explicitly
deferred. Adding a new item kind still requires a variant bump,
which is the right trade-off while the variant count is small.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 295.42s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The module name `validator_metadata` was misleading — it bundled
three orthogonal P2P endpoints that have nothing to do with
"validator metadata" in the dictionary sense. Rename to
`mpc_artifacts` and split into purpose-named submodules:

- `mpc_artifacts/blob_store.rs` — content-addressed `mpc_data`
  blob storage (`MpcDataBlobStorage`, `InMemoryBlobStore`,
  `mpc_data_blob_hash`, `GetMpcDataBlobRequest`, `MpcDataBlob`,
  `fetch_blob`).
- `mpc_artifacts/announcement_relay.rs` — joiner announcement
  forwarding (`AnnouncementRelay`, `AnnouncementRelayHandle`,
  `SubmitMpcDataAnnouncement{Request,Response}`,
  `submit_announcement_to_peer`,
  `submit_announcement_to_committee`).
- `mpc_artifacts/handoff_cert.rs` — handoff cert retrieval
  (`HandoffCertStorage`, `GetCertifiedHandoffAttestationRequest`,
  `fetch_certified_handoff_attestation`).
- `mpc_artifacts/server.rs` — Anemo `ValidatorMetadata` impl,
  unchanged behavior (moved + import paths fixed).
- `mpc_artifacts.rs` — top-level module: `mod generated`,
  submodule declarations, re-exports of every public surface so
  external callers still write `ika_network::mpc_artifacts::X`
  without caring which submodule X lives in, and the public
  `build_server` constructor.

Anemo service wire name stays `ValidatorMetadata` (and the
codegen include stays `ika.ValidatorMetadata.rs`) — the
rename is internal-only, no protocol break. Tests for each
submodule moved next to their code (blob_store + relay tests).

External rename: `ika_network::validator_metadata` →
`ika_network::mpc_artifacts` across ika-core, ika-node, ika-types
inline paths, and ika-network's own build.rs request_type /
response_type paths.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 265.88s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a single `off_chain_validator_metadata` feature flag and
bumps `MAX_PROTOCOL_VERSION` from 4 to 5; the flag flips on at v5.
All off-chain pipeline hooks now check this flag and fall back to
legacy chain-only behavior when false. The Sui-style protocol-
version advance means every validator switches together at the
exact consensus round the network advances to v5 — no mixed-
version freeze-quorum stalls, no asymmetric blob caches, no
divergent handoff attestations.

Six gates, all failing closed to legacy:

1. Producer tasks self-exit on `run()` when the flag is false:
   `MpcDataAnnouncementSender`, `HandoffSignatureSender`,
   `JoinerPubkeyProviderUpdater`,
   `ConsensusPubkeyProviderUpdater`. Each reads
   `epoch_store.protocol_config().off_chain_validator_metadata_enabled()`
   once at task start.

2. ika-node `monitor_reconfiguration` reads the flag once per
   epoch and skips spawning the four tasks, the relay install,
   and the two `SuiConnectorService` source installs
   (`install_network_key_blob_source`,
   `install_class_groups_source`) when off — saves the spawn
   churn even though the tasks self-gate. `EndOfPublishSender`
   stays unconditional since it's core-protocol.

3. Consumer record paths bail early when the flag is false —
   defensive, so a stray new-kind `ConsensusTransaction` from a
   peer can't allocate state:
   `record_validator_mpc_data_announcement`,
   `record_epoch_mpc_data_ready_signal`,
   `record_network_key_dkg_ready_signal`,
   `record_handoff_signature`.

4. Step-14 kickoff gate `off_chain_gate_passes` evaluates to
   `true` (legacy behavior) when the flag is off. Otherwise
   gates on `is_mpc_data_frozen()`. New trait method
   `off_chain_validator_metadata_enabled` on
   `AuthorityPerEpochStoreTrait` so the gate site can reach the
   flag through the trait object. `TestingAuthorityPerEpochStore`
   returns `true` to preserve existing integration-test behavior.

5. Step-9 producer cache hook in
   `DWalletMPCService::new_dwallet_mpc_output` skips when the
   flag is off — leaves the digest tables empty so the syncer
   overlay path naturally falls through to chain-only reads.

6. Syncer overlays
   (`sync_dwallet_network_keys`, `new_committee`) don't need
   explicit flag checks: when the flag is off, ika-node skips
   `install_*_source`, the source handles stay None inside
   `SuiConnectorService`, and the existing source-handle checks
   fall through to chain.

Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 313.64s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brings PR #1 (cleanup, ika-benchmark removal), PR #2 (bootstrap library),
PR #3 (ika-test-cluster), and the Inkrypto cryptography-private bump
(post-PR-#1707 `ValidatorEncryptionKeysAndProofs` shape:
class-groups + per-curve PVSS HPKE).

Merge resolutions:

* `authority_per_epoch_store.rs`: take origin/dev's tuple key
  `DBMap<(SessionIdentifier, u16), AssignedPresign>` for
  `assigned_presigns_schnorrkel_substrate` (PR #1707 fix) AND keep the
  seven off-chain metadata fields from this branch.

* `pnpm-lock.yaml`: keep upstream `sdk/signature-mpc-wasm/pkg: {}`
  entry; the stale stashed `sdk/ows/...` entries are already removed.

* `protocol-config/lib.rs`: keep `MAX_PROTOCOL_VERSION = 4`. Merge
  `network_encryption_key_version = Some(3)` and
  `reconfiguration_message_version = Some(3)` into the v4 arm so the
  Inkrypto crypto activates at the current MAX. The v5 arm
  (`noa_checkpoints = true`) is commented out as a forward-looking
  reference. Rewrote the version-history comment with one line per
  version. User's manual `internal_presign_sessions = false` at v4
  preserved.

* Off-chain pipeline PVSS extension: the Inkrypto bump expanded
  `Committee::new` with three new PVSS HashMaps (secp256k1, secp256r1,
  ristretto). Extended `OffChainCommitteeClassGroupsSource` to
  assemble all four maps from the same blob bytes via the shape-
  tolerant `decode_validator_encryption_keys`. Validators publishing
  under mainnet-v1.1.8 shape contribute only class-groups; post-PR-#1707
  validators contribute the full bundle — matching chain-fallback
  semantics in `sui_syncer::new_committee`.

* Test-only `Committee::new` call sites in `validator_metadata.rs`:
  pass three empty PVSS maps to satisfy the new 8-arg signature.

* Protocol-config snapshots regenerated for v3/v4 (off-chain flag
  flipped on at v4, crypto-v3 active at v4) plus v5 snap files kept
  on disk as forward-looking reference for the commented v5 arm.

Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` passes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Make `request_add_validator_candidate`, `request_add_validator`, and
`stake_ika` `pub` in `ika-swarm-config::sui_client` so the upcoming
`IkaTestCluster` joiner helper can reuse the battle-tested PTB
builders rather than duplicating them. No behavior change — same
functions, broader visibility.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ycscaly and others added 7 commits June 9, 2026 08:13
Scenario::with_base_dir keeps node logs after a failure (default temp dir is
cleaned on drop, which hid the v1.1.8 boot panic). with_epoch_timeout for
slower heterogeneous runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cross_binary test passes end-to-end (722s): 4 out-of-process validators
boot on a v3-only binary, complete network DKG, then all swap to dev (v3..v4)
and the capability vote advances v3 -> v4 at the next epoch. Exercises the
protocol-vote arithmetic, mid-epoch reconfiguration across the swap, mixed-
committee wire compat, and on-disk compat (restart on new binary, old RocksDB).

Tuning that made it pass: 10-min epochs and swap-all-then-one-transition, to
avoid the known sui_executor gas-coin-contention epoch wedge (short epochs +
swap churn froze the notifier's advance-epoch executor) and to keep each swap
clear of the mid-epoch reconfiguration window. Scenario gains
with_epoch_duration_ms / with_epoch_timeout.

The OLD binary is a dev build pinned to MAX_PROTOCOL_VERSION=3 (same crypto as
dev, differs only in advertised version). The literal mainnet-v1.1.8 ika-node
is crypto-incompatible (inkrypto vs cryptography-private class_groups; v4 key-
shape change) and cannot share a committee with dev — a finding documented in
the test, confirming the dual-pin premise of docs/plan-update-crypto-latest.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
workload.rs: issue_dkg returns the txn digest (completion is confirmed via the
coordinator session counter, not a per-dWallet read); protocol-pp fetch retries
on a partial-TableVec decode error. tests/workload.rs (GREEN) proves the
submission path end-to-end: protocol params from the on-chain network key,
centralized Curve25519 party, coordinator txn executes.

KNOWN GAP (documented in the test + results doc): the coordinator ignores the
emitted event ("not a DWalletSessionEvent"), so the session does not complete —
the driver must call register_encryption_key before the DKG (as the TS SDK
does). Presign/Sign build on a completed DKG and are not implemented.

docs/cross-binary-upgrade-testing-results.md summarizes what was built, the
green go/no-go + cross-binary(v3->v4) runs, the v1.1.8 crypto-incompat finding,
the epoch-wedge tuning, and the workload gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on-chain

tests/workload.rs passes (~415s): a user dWallet completes register-encryption-
key -> DKG(Active) -> presign(verified) -> sign, the sign confirmed on-chain via
the coordinator's user completed_sessions_count. Proves the session-lifecycle
invariant (sessions started in an epoch complete; no silent drops).

The driver orchestrates the canonical `ika` CLI (the tested Rust client) rather
than re-deriving the user-side 2PC. Making it reliably green surfaced real
system properties, all handled:
- dedicated, separately-funded user (faucet SUI + IKA transfer) — sharing the
  publisher key with the notifier causes coin-lock contention;
- register-encryption-key before create (encrypted DKG borrows the user key
  from the coordinator);
- v4 genesis (internal_presign_sessions is a v4 feature);
- 30-min epoch so the lifecycle runs clear of the mid-epoch reconfiguration;
- confirm sign via on-chain completed-count, not the CLI's racy --wait poll.

Adds shared-crypto + fastcrypto deps for the IKA-funding transfer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…it, buffer-stake override

Hardening surfaced while driving a real mainnet-v1.1.8 -> dev rolling swap:

- process.rs: swap now stops the node with SIGTERM (which ika-node's
  `wait_termination` handles for an orderly shutdown) and waits for clean exit,
  with SIGKILL only as a fallback. The previous hard SIGKILL interrupted the
  node mid-consensus-round and left dwallet-MPC replay state partial on disk,
  which crashed the next binary on replay (`consensus round mismatch ...`). A
  binary swap is a planned restart, not a crash, so it must be graceful.

- cluster.rs: document that the epoch counter advancing to N is itself the
  completion signal for epoch N-1 (reconfiguration into a new epoch is gated on
  that epoch's network-key MPC finishing), so callers wait for the epoch *after*
  the work they depend on rather than polling key state.

- scenario.rs / process.rs: add a `set_buffer_stake(bps)` step
  (POST /set-override-buffer-stake) so a quorum, not unanimity, advances the
  protocol version. With n=4 the default 50% buffer rounds up to requiring all
  four votes; a rolling swap can leave one validator's fresh capability
  uncommitted at the boundary tally.

- cross_binary.rs: wait for epoch 2 before swapping (the genesis network DKG
  runs during epoch 1, so epoch 2 guarantees it finished under the old binary),
  drop the buffer stake to a quorum, then wait for epoch 3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t; gate internal-presign read on v4

Wire the cross-binary upgrade path against the off-chain-metadata branch so a
mixed-version committee survives a rolling binary swap:

- verify_validator_keys decodes whatever class-groups key shape is on-chain
  (bare mainnet-v1.1.8 `ClassGroupsEncryptionKeyAndProof` or the post-bump
  combined `ValidatorEncryptionKeysAndProofs`) via
  `decode_validator_encryption_keys`, comparing only the class-groups component
  that identifies the seed. PVSS keys are verified off-chain on the assembly path.
- validator_initialization_config publishes the BARE mainnet-v1.1.8 shape
  on-chain (the richer bundle travels off-chain via validator P2P), so a
  v1.1.8 binary can still decode the record during the upgrade window.
- The internal-presign output read is gated on `internal_presign_sessions_enabled()`
  (a v4 feature) so a pre-v4 node mid rolling-upgrade skips the sparse stream
  instead of panicking on the dense per-round assertion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- bin default scenario `rolling_majority_then_minority` mirrors the proven-good
  config (10-min epochs + 1800s wait timeout + `set_buffer_stake(0)` before the
  upgrade-crossing wait) so the v3->v4 vote can land under n=4.
- `wait_for_epoch` logs a failed `current_epoch` read instead of silently
  treating it as epoch 0 until the deadline.
- Document that the workload sign-completion check (coordinator user
  `completed_sessions_count`) is sound only because the harness drives a single
  dedicated user with one sign in flight.
- results doc: caveat the cross_binary GREEN row (version-only swap; the real
  v1.1.8 crypto-boundary swap is not exercised) and note the single-instance /
  fixed-port (9000/9123) constraint.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ycscaly ycscaly changed the base branch from dev to feat/off-chain-metadata-v2 June 9, 2026 08:18
@ycscaly ycscaly force-pushed the feat/ika-upgrade-test branch from 8672c96 to f858d37 Compare June 9, 2026 08:19
omersadika and others added 20 commits June 9, 2026 11:32
…te + read)

Same fix landed on dev via #1728, applied here on the off-chain-metadata file
structure. The consensus-output replay loop asserts each per-round table's
record round equals the driver round (`dwallet_mpc_messages`). Tables added
after mainnet-v1.1.8 are sparse when a dev binary replays a v1.1.8-written
RocksDB (rolling binary swap), tripping the assertion. Gate write + read of each
on the feature that introduced it:

- internal_presign_sessions: dwallet_internal_mpc_outputs,
  global_presign_requests, idle_status_updates
- noa_checkpoints: verified_system_checkpoint_messages, noa_observations,
  sui_chain_observation_updates

(`network_key_data_messages` is already removed on this branch by the off-chain
work, so it needs no gate here.) Validated by the cross-binary upgrade harness:
mainnet-v1.1.8 -> dev rolling swap reaches protocol v4.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	crates/ika-core/src/authority/authority_per_epoch_store.rs
#	crates/ika-core/src/dwallet_mpc/dwallet_mpc_service.rs
…erified handoff cert

Replace the barrier's readiness condition 2. It previously read chain-derived
fields off the `network_keys_receiver` overlay (`current_epoch >= next_epoch`
plus a non-empty `current_reconfiguration_public_output`), which the
`snapshot_ready_for_signing` gate deliberately avoids because the overlay can
surface the prior epoch's output a round behind via the perpetual mirror — so a
non-empty value there does not prove THIS epoch's reconfiguration is local.

Now the barrier decides readiness off the same off-chain signals everything else
trusts: the verified `cur_epoch` handoff cert (the cross-epoch trust anchor) plus
this validator's local reconfiguration-output digest slice. The cert's single
`epoch` scopes the whole handoff, so there is no per-key epoch to check — only
that every `NetworkReconfigurationOutput` item the cert certifies is held locally
with a matching digest (`all_cert_reconfiguration_outputs_held_locally`).
`prepare_handoff_anchor` now returns the cert so the caller reads its items
directly, and the chain-fed `network_keys_receiver` dependency (and the seam
blob-source pre-install that only existed to feed it) are dropped.

Also fix a wedge this exposed: holding the cert does NOT imply holding the
outputs it certifies. A lagging validator can adopt the cert from a buffered
peer-signature quorum without ever computing or caching those outputs, so the
persisted-cert fast path now fetches + caches them too (idempotent) — otherwise
a cert-but-no-outputs validator blocks at the barrier forever, never enters the
epoch, and never publishes its mpc_data.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…c_data set at the epoch boundary

A resource-slow validator would lag the epoch handoff and wedge reconfiguration
in three independent ways. This fixes all three so the boundary converges on the
full committee instead of locking at a bare quorum.

1. Persist the cert from observed signatures, not just from signing.
   `record_handoff_signature` buffered peer signatures until this validator
   computed its own attestation; a validator whose snapshot lagged never did, so
   it never persisted the cert and had to re-fetch its own prior-epoch cert at
   the next boundary. Now, once the buffered peer signatures show a stake-quorum
   agreeing on one attestation (`quorum_attestation_in_buffer`), adopt it and
   persist the cert from the observed quorum (replay re-verifies every signature,
   so a byzantine member can neither forge the cert nor block a real quorum).

2. Freeze the mpc_data input set only when a DKG/reconfiguration actually starts
   AND a quorum is present — not prematurely at epoch start. The freeze used to
   fire on the first ready-signal quorum, on a wall-clock deadline the long
   genesis-DKG transition had already consumed, locking the set at sub-full
   coverage before slower validators' mpc_data propagated. It now fires from the
   DKG/reconfiguration session gate (`freeze_mpc_data_if_quorum`), which a request
   reaches only after the next active committee is published (mid-epoch) — by
   which point coverage is complete, so the frozen set holds every member.

3. Defer the epoch close a configurable number of consensus rounds past the
   EndOfPublish quorum so straggler `EndOfPublishV2` bundles — which carry handoff
   signatures — are sequenced before the epoch closes. The close (factored into
   `build_epoch_close_checkpoint_messages`) now fires at the commit boundary once
   every committee member has voted OR the leader round has advanced
   `end_of_publish_grace_rounds` (new protocol constant, default 50) past quorum.
   Measured as a leader-round delta (rounds skip — not +1 per commit), and the
   anchor round is persisted so a validator restarting mid-grace closes at the
   same round as its peers (the final checkpoint must be deterministic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ore driving

The workload waited for epoch 1 (genesis, reached immediately) and drove the
dWallet lifecycle right away, with a 30-minute epoch. At v4 the genesis
network-key DKG is gated on the off-chain mpc_data freeze, whose ready-signal —
with no next-epoch committee published yet at genesis — only fires at the
3/4*epoch_duration deadline (~22 min on a 30-min epoch). So the network key
wasn't on-chain when the CLI tried to derive protocol parameters, and the
driver's ~2-min retry budget gave up long before.

Wait for epoch 2 instead of 1: the counter advancing to 2 is itself the
completion signal for the genesis network DKG (reconfiguration into epoch 2
reshares that key, which can't happen until the DKG finished), so the key is
guaranteed readable — and don't drive the lifecycle before then, when it could
only fail. Shorten the epoch to 4 min so the freeze deadline (~3 min) is
reachable while still clearing validator bring-up + announcement recording
(~90s) and leaving the lifecycle room inside the next epoch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… mid-epoch at the v3->v4 boundary

The off-chain network-key blob overlay is keyed by key ID only, so the
moment this epoch's mid-epoch reconfiguration finalizes locally, the
syncer's merged key data starts carrying the output produced *for the
next epoch's committee* — shares encrypted to next-epoch party IDs,
which need not align with this epoch's (on-chain committee order is not
stable across epochs). In steady-state v4 the cert anchor in
`adopt_cert_verified_keys` rejects it (the prior epoch's handoff cert
pins the output produced FOR the current epoch), but the first v4 epoch
after the v3->v4 upgrade has no prior cert, fell into the cert-less
boundary path, and adopted blindly — every validator then failed
decryption with ClassGroup(Decryption) using this epoch's identity on
next-epoch-dealt shares.

Guard the boundary path the same way the cert anchor does: skip
adoption when the reconfiguration output's digest matches the one this
epoch's own reconfiguration session recorded (epoch-keyed perpetual
digest, new point lookup). The next epoch's manager adopts and decrypts
it with next-epoch identity at epoch start, as in steady state.

Also hoist the last-failed check in the instantiation filter so it
applies to every branch: previously the `Some(prev)` branch re-selected
the failing bytes every poll tick (they differ from the last
*successfully* instantiated ones by definition), re-running a doomed
~18s class-groups decrypt per tick and starving the service loop —
checkpoints (including EndOfPublish) never certified, wedging epoch
advance behind the decryption failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ejections

The stale-gas recovery (drop the cached gas ref, floor the re-fetch at
the rejected version) only ran when the rejection arrived inside
`tx_response.errors`. But the fullnode also rejects at the JSON-RPC
layer (ServerError -32002, "Transaction needs to be rebuilt ... object
unavailable for consumption"), which surfaces as `Err` from
`execute_transaction_block_with_effects` and bailed out before the
recovery code — the cached gas ref survived, so every
`retry_with_max_elapsed_time!` attempt rebuilt the byte-identical stale
tx and re-rejected, wedging checkpoint delivery to Sui for the full
one-hour window (observed: dwallet checkpoints stuck behind a gas coin
advanced by the shared publisher address in the test cluster, blocking
DKG settlement, mid-epoch reconfiguration, and epoch advance).

Factor the recovery into
`NotifierSubmitState::handle_possible_stale_gas_rejection` and apply it
on both paths. Note `IkaError` derives strum's `AsRefStr`, so
`err.as_ref()` yields only the variant name — match the
`SuiClientTxFailureGeneric` payload to get the actual message.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…grade boundary

Genesis at v3 (MIN) instead of v4 (MAX): a v4 *genesis* network DKG is
rejected forever (PVSS keys only arrive through the next-committee-only
off-chain assembly), so the supported path — and the one mainnet takes —
is genesis v3, then upgrade into v4 via the capability vote. The test
now waits for epoch 2 (v3 genesis DKG + reshare done), zeroes the
buffer stake so the 4-validator vote tallies at bare quorum, waits for
epoch 3, asserts protocol >= v4, and only then drives the
DKG -> Presign -> Sign lifecycle. This exercises the cert-less v3->v4
reconfiguration-adoption boundary fixed in the previous commits.

Remove HANDOFF.md — the reshare-decrypt bug it described is fixed and
the workload test is green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-gas recovery

Clippy's `unnecessary_to_owned` suggests `err.as_ref()` for
`&err.to_string()` here; it compiles because `IkaError` derives strum's
`AsRefStr`, but that returns only the variant name — never the rejection
markers — silently disabling the recovery.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dators

Internal presign sessions get their sequence number from a single shared
counter, assigned in iteration order over (network key id) x (curve) x
(signature algorithm), and the sequence number is bound into the session
identifier transcript. Both iteration sources were unordered:

- SUPPORTED_CURVES_TO_SIGNATURE_ALGORITHMS_TO_HASH_SCHEMES was a
  HashMap<u32, HashMap<u32, Vec<u32>>> — iteration order is random per
  process (RandomState), so each validator walked curves/algorithms in a
  different order;
- the agreed network key ids were iterated straight off a HashMap.

Each validator therefore derived *different* session identifiers for the
same (curve, algorithm) work. Those sessions could never reach quorum, so
they never completed, and the instantiated != completed gate then blocked
that algorithm's pool top-ups for the entire epoch. Once a user presign
request locked onto the starved pool, the EndOfPublish condition was
unsatisfiable and the epoch could not advance.

Observed live: in a 4-validator run the validators logged three distinct
top-up orders, and exactly the sequence numbers whose (curve, algorithm)
assignment happened to agree on 3+ validators completed — the rest hung
forever, the ECDSA pool stayed empty all epoch, and the run timed out.
A previous green run was a per-process-seed coin flip.

Fix: BTreeMap at both nesting levels of the static, and collect the
agreed key ids into a BTreeSet before the instantiation loop. Pre-existing
bug from the internal sessions instantiation logic (#1638), not specific
to this branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The reconfiguration overlay (next-epoch network key data computed off-chain
during reconfiguration) was stored bare and adopted based on a
produced-this-epoch digest guard. Instead, store the target epoch alongside
the key data and adopt it only when it matches the epoch actually being
entered — epoch-correct by construction, no guard heuristics.

- validator_metadata.rs: overlay entries carry the epoch they were computed
  for; lookups take the target epoch.
- authority_per_epoch_store.rs / authority_perpetual_tables.rs: persist and
  reload the epoch alongside the overlay data; drop the digest-guard
  plumbing.
- mpc_manager.rs / sui_syncer.rs: pass the target epoch through adoption and
  ignore overlay data for any other epoch.

Validated by a full workload run: genesis at v3, upgrade into v4 at epoch 3,
v3 -> v4 reshare decrypts cleanly on all validators, DKG -> Presign -> Sign
lifecycle green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two upgrades to the cross-binary rolling-upgrade test:

1. Committee changes at every epoch boundary, with a different committee
   size each epoch (4 -> 3 -> 5 -> 4): a validator removal coincides with
   the v3 -> v4 protocol bump, two brand-new validators join via the full
   candidate -> stake -> activate flow (their class-groups keys registered
   on-chain, so the v4 reshare encrypts to parties that never held the
   key), and a final removal reshapes 5 -> 4. Every boundary after the
   first is a real reshare to a different party set.
   - sui_client.rs: request_add_validator / request_remove_validator /
     candidate registration helpers with explicit sender, shared version,
     and cap so the test can drive membership without touching the active
     wallet address.
   - network_config_builder.rs: configurable min_validator_count (the dip
     to 3 is below the protocol default of 4).
   - scenario.rs / cluster.rs / process.rs: join_validator,
     remove_validator, expect_committee_size scenario steps; spawn /
     stop / swap of individual validators on different binaries.

2. Rough per-protocol MPC timing report (mpc_timings.rs): scrape the MPC
   duration metrics from each validator after the v3 (old binary) and v4
   (new binary, churned committee) workload runs, and print a comparison
   table at the end of the run. Informational and flagged, not asserted —
   wall-clock on a loaded developer machine is too noisy to gate on.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…set post-upgrade

The cross-binary churn test wedged at the v3 workload: genesis wrote the
full GlobalPresignConfig, routing ECDSA presigns to the global path, which
is served exclusively from the validators' internal presign pool — and that
pool only fills once internal_presign_sessions activates at protocol v4.
At v3 the presign was unservable forever.

Genesis now takes GenesisGlobalPresignConfig (Full | Empty). Empty is the
mainnet-v1.1.8 on-chain state (the config object must still exist — the
coordinator reads it with a bare dynamic-field borrow). The cross-binary
scenario uses Empty at genesis and a new SetGlobalPresignConfig step right
after the v4 upgrade is confirmed — the same operational ordering a real
mainnet rollout must follow: set_global_presign_config only after v4
activates, or ECDSA presigns stall network-wide until it does.

Existing genesis-at-v4 tests keep Full (exact current behavior). Also
rewords the cross_binary doc comment: the literal v1.1.8 binary failing on
harness genesis is a registration-shape artifact (post-#1707 bundle bytes),
not a production-direction gap — the new binary reads v1.1.8 keys via the
shape-tolerant decode.

Verified in the churn run: v3 workload completed (vs infinite wedge),
v3→v4 vote passed, post-upgrade config set succeeded.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le coin

transfer_one_ika took the publisher's first IKA coin — the genesis supply
coin (ika_supply_id) — and transferred it whole to the workload user. The
first churn run to stake a joiner after a workload exposed it: stake_ika
splits the joining stake from ika_supply_id signed by the publisher, which
no longer owned it ("Transaction was not signed by the correct sender").
A second workload on the same cluster would have failed the same way
("publisher owns no IKA").

Pay a fixed 1000-IKA allowance to the workload user instead; the supply
coin stays with the publisher.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s activates

The global-presign pipeline is gated on the internal_presign_sessions
feature flag (protocol v4) on its consensus side, but session intake
diverted every global presign request to the pool unconditionally. Below
v4 that strands the request: no pool to serve it, no MPC session spawned,
and the session is locked into its epoch — all_current_epoch_sessions_-
completed blocks advance_epoch, so the epoch can never end and v4 can
never activate. Mainnet's GlobalPresignConfig is already populated (every
production ECDSA presign routes to global), so a single presign request
in flight after the upgrade restart would have wedged the network at v3
permanently.

Gate the diversion on the same flag: pre-activation, the request falls
through to a user-requested MPC session — the v1.1.8 serving behavior,
whose input (dwallet-output-less presign computation) and output
(RespondDWalletPresign with no dwallet_id, VersionedPresignOutput::V2)
paths are intact on this branch.

Caught by the new v118_upgrade rehearsal: genesis a 4-validator committee
on the literal mainnet-v1.1.8 ika-node with the verified mainnet-shape
populated GlobalPresignConfig, run the mainnet user flow at v3 (DKG with
Universal output, global presign as a user session, sign), atomically
swap all validators to the local build, and probe the pre-activation
window with a workload that must complete its global presign at v3 via
the fallback before the boundary. The run then crosses into v4 (the local
binaries reshare the 1.1.8-created network key), serves a pool-backed
global presign, and completes one more clean reshare.

Also corrects GenesisGlobalPresignConfig and cross_binary docs: Full is
the actual verified mainnet on-chain state, Empty is a harness
arrangement (and the only targeted-presign coverage), not the mainnet
shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ith window-delta tables

The scraped MPC duration metrics are cumulative per process, so a later
snapshot blends everything the validator ever ran — a v3-protocol reshare
and a v4-protocol reshare land in one row and the ratio table reads 1.00x.
Add a window table to each comparison: (sum2-sum1)/(completions2-completions1)
between consecutive snapshots isolates just the work done between them
(skipped across a swap, where the counter reset makes the delta negative).

Extend the v118 rehearsal past epoch 4 with two new snapshots:
- v4-reshare: the first reshare executed under the v4 reconfiguration
  math (reconfiguration_message_version = 3, PVSS HPKE) — the epoch 2->3
  reshare still ran the v3 protocol, so the previous run never measured
  v4 reconfiguration cleanly;
- local-v4-settled: a full lifecycle after the pools finished their
  initial fill, pricing v4 DKG / pool presign / sign without the
  boundary work.

Run is green (1105s): the v4-math reshare window prices at 53.2s/7.4s/30.8s/9.6s
per round vs the local binary's v3-math reshare at 9.5s/2.5s/8.6s/2.8s —
with continuous internal-presign pool top-ups sharing the cores.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	crates/ika-core/src/dwallet_mpc/mpc_manager.rs
Base automatically changed from feat/off-chain-metadata-v2 to dev June 12, 2026 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants